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Because of the rapid growth in technology breakthroughs, including 
multimedia and cell phones, Telugu character recognition (TCR) has recently 
become a popular study area. It is still necessary to construct automated and 
intelligent online TCR models, even if many studies have focused on offline 
TCR models. The Telugu character dataset construction and validation using 


an Inception and ResNet-based model are presented. The collection of 645 
letters in the dataset includes 18 Achus, 38 Hallus, 35 Othulu, 34x16 
Guninthamulu, and 10 Ankelu. The proposed technique aims to efficiently 
recognize and identify distinctive Telugu characters online. This model's main 
pre-processing steps to achieve its goals include normalization, smoothing, 
and interpolation. Improved recognition performance can be attained by using 
stochastic gradient descent (SGD) to optimize the model's hyperparameters. 
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1. INTRODUCTION 

One of the key modules in most optical character recognition (OCR) systems is character recognition, 
one of the pattern recognition study fields. The method typically starts with feature extraction and ends with 
classification [1]. Character recognition feature extraction involves converting a segmented character image 
into a real-valued feature vector that more accurately describes the character on the image. The capacity of 
features to discover characteristics that distinguish among participating classes aids classifiers in creating 
models with clearly defined decision limits [2]. The feature extraction algorithm significantly influences the 
character recognition process' accuracy. The segmented character images occasionally show the character in a 
slightly translated, rotated, or deformed state [3]. Images of documents in a high-quality state can use accurate 
character recognition technologies. Regarding real-time applications, document quality is essential and greatly 
impacts how well the recognition system works [4]. The key elements determining the document image quality 
are the characteristics of the document used, the contents within the document, and document deterioration. 
These characteristics subsequently impact recognition. Characters may appear broken or touching depending 
on the type of paper used, especially when printed documents are involved [5]. 

Non-standard fonts result in uneven spaces between the characters in printed documents and the issue of 
touching glyphs. In the case of handwritten documents, the content is created by a human; the uniformity of the 
space usage and the smoothness of the strokes influence the document's quality [6]. Another problem with 
handwritten papers is that some people simplify complex glyphs in the language script while writing them, which 
reduces inter-class diversity and increases intra-class variability [7]. Document aging and document digitization 
for processing may create many types of distortion in addition to problems in the paper used and defects that 
occurred during printing or writing. Some characters in documents may not be correctly recognized even by 
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humans without contextual information due to the distortions and degradations that occurred to the document 
images [8]-[10]. Adopting a verification technique might be beneficial for such accuracy-sensitive applications 
where the recognition errors may be costly in some cases. The verification procedure assesses the classifier's or 
recognizer's performance and generates a trustworthy acceptance or rejection of the input pattern [11], [12]. 

Even in distortions, the recognition system should correctly identify the character. To develop a 
powerful character recognizer, the character image must be efficiently represented as an invariant picture. In 
most pattern recognition applications, deep learning (DL) approaches are cutting-edge. Convolutional neural 
network (CNN) architectures that are based on DL can learn invariant feature descriptors through the use of 
subsampling and trainable filter banks [13]—[15]. 


2. RELATED WORK 

Historical texts usually contain a large number of dispersed characters, making it difficult to localise 
and distinguish between them using formal proposal and regression-based techniques. Yang et al. [16] 
published a unique approach known as a recognition guided detector that successfully detects precise Chinese 
characters in old texts. A detection network that uses this data to precisely localize each character and a 
recognition that provides context material about the text make up the two concurrently trained CNNs that make 
up the proposed reduced gradient descending (RGD). Two more datasets with character-level annotations were 
constructed to train and test the recommended method. The databases' contents are made up of scanned copies 
of the Tripitaka. Supported by text recognition with 97.25% accuracy. 

Even though text arrangement created on natural language processing (NLP) has shown promising 
results and has a wide range of potential practical applications, including clinical medical value, the task of 
NLP for Chinese electronic medical records (CEMRs) has conventional less attention than English record data. 
The majority of the already accessible CEMRs are non-institutionalized texts with sloppy grammar, poor usage 
rates, and a propensity to mix patient symptoms, prescriptions, diagnoses, and other critical information. 
Zhang et al. [17] capsule network model for electronic medical record categorization uses a unique routing 
architecture and combines long short term memory (LSTM) and gated recurrent unit (GRU) models to extract 
intricate medical text elements. The model outperforms other baseline models by at least 4.1% and excels on 
the CEMRs dataset with an F1 score of 73.51%. 

Given the ubiquity of handwritten documents in interpersonal interactions, character recognition (CR) 
of documents has enormous practical utility. Many different types of images can be transformed into editable, 
searchable, and analysable data thanks to the field of OCR. For the past ten years, researchers have been 
digitizing printed and handwritten texts using artificial intelligence (AI) and machine learning (ML) 
techniques. Memon ef al. [18] examined character recognition to make investigation recommendations. 
Adhered to a predetermined review method and employed commonly used electronic databases. Observing in 
depth the selecting process for the study. There are 176 items in this systematic literature review (SLR) that 
were picked. In India, the bulk of the population speaks Hindi, however the language of most signboards is 
English. On a trip for business or pleasure, the travellers become bewildered by the numerous English-written 
signboards. They can rely on cell phones, which have grown in popularity in recent years, for the same 
functions. 

According to Arafat et al. [19], they worked to develop a mobile application that can recognise the 
English text and symbols on a signboard image, detect and translate the content and symbols from English to 
Hindi, and then show the translated Hindi text back on the phone's screen. The system uses an English-to-Hindi 
lexicon for translation, a pre-trained faster regional CNNs for object detection, and tesseract OCR for text 
extraction. A means of communicating choices regarding the precise design, acquisition, procurement, 
building, and commissioning of a plant. 

Kim et al. [20] provide a solution for the piping and instrumentation diagram (P&ID) picture text 
translation problem using DL technology. Pre-processing P&ID photos and storage of the recognition results 
are all steps in our suggested methodology. Think about how to identify symbols in high-density images that 
are different in size and complexity. When the model was tested on this dataset after it had been trained, the 
results were surprisingly good, with precision, and recall for symbols being 0.9718 and 0.9827, and for text 
being 0.9386 and 0.9175, respectively. Due to the rapidly expanding problem of financial tickets are putting 
financial accountants under increasing strain and wasting an unnecessary amount of personnel. 

Zhang et al. [21] suggest an architecture of the financial ticket intelligent recognition system (FTIRS) 
that iteratively self-learns to address this problem. A functional financial accounting system must allow iterative 
updating and the flexibility of the algorithm model, both of which are supported by this framework. To increase 
its effectiveness and efficiency even more, developed an intelligent financial ticket data warehouse. The system 
can presently distinguish between 482 different sorts of financial tickets and has an autonomous iterative 
optimization process. As a result, the types of tickets that the system can recognise and their accuracy will grow 
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as application processing times do as well. The system's value in business has been established. It can greatly 
boost financial accounting efficiency while also lowering the cost of recruiting accounting staff. Arabic, Chinese, 
and Hindi are other languages with cursive writing besides Latin. One of these scripts is Urdu. Urdu text makes 
it challenging to spot specific ligatures in scene shots and to locate them from natural scene photos. 

In accordance with the method laid out by Arafat and Iqbal [22], Urdu ligatures are identified, their 
orientation is predicted, and they are recognised in outdoor pictures. Squeezenet, Googlenet, Resnet18, and 
Resnet50 have been integrated with the customised faster regions based CNN (FasterRCNN) algorithm for 
identification. To identify ligatures, a two-stream deep neural network (DNN) was employed. The common 
learning environment (CLE) annotation text was used to produce five sets of datasets containing 4.2K and 51K 
artificial images with embedded Urdu text for our testing, which evaluated a variety of functions including 
ligature detection. Additionally, time series deep neural network (TSDNN) was tested using 1,094 real images 
that had more than 12K Urdu characters. Evaluated and compared the capacity of all four detectors to locate 
or recognise Urdu text with average precision (AP). With an AP of 0.98, FasterRCNN, which is based on 
Resnet50 features, was shown to be the best detector. 

Oliveira et al. [23] addressed the non-overlapping camera vehicle identification problem in their study. 
Presenting the vehicle-rear dataset, a novel dataset for identifying vehicles is our main contribution. To 
investigate our dataset, our two-stream CNN makes use of the car's exterior and licence plate, two of the most 
recognisable and enduring aspects currently available. This initiative solves a serious problem: false alerts 
instigated by vehicles with identical designs or plates that are really similar [24]. A Siamese CNN can detect 
shape similarities in the first network stream by comparing two low-resolution car patches taken by two 
cameras. In the second stream, two high-resolution licence plate patches are used, and a CNN is used. To reach 
a decision, a set of entirely interconnected layers incorporate the features from both streams. OCR work as per 
state of art is as listed in Table 1. 


Table 1. OCR as per state-of-art 


. ; Future oe : A 
Ref. Pre-processing Segmentation kee Optimizer Novelty/technique Post processing 
[24] Residual identity Image Gabor filter Kernel self- J&M model --- 
block segmentation optimization 
fisher classifier. 
[25] Residual identity --- CNN Momentum Hippocampus-heuristic --- 
block feature optimizer character recognition 
extractor network (HCRN), 
pseudo-Siamese network 
[26] Bilinear --- ResNet and Meta-optimization customized recurrent Technique customs 
interpolation SEnet neural network (CRNN) gate recurrent units 
technique 
[27]  Niblack’s --- Hamming Winner takes all complementary Similarity measure 
method network- algorithm (WTA) similarity measure neural network 
hamming (CSM) method, (SMNN) 
subnet, Convolutional neural 
CNN networks 
subnet 
[28] Data Dropout and Separable Back propagation ResNet's short --- 
augmentation batch convolution algorithm connection structure 
technology normalization 
[29]  Binarization, Projection Novel MLLR-based. Hidden Markov models Viterbi decoding 
skew correction, segmentation adaptive HMM adaptation 
and noise sliding techniques 
removal window 
[30] Dilation and --- CNN Error-correcting CNN Cross validation 
erosion (AlexNet) output codes method 
(ECOC) 
[31 Scaling, noise Vertical Kernel Gradient descent DS-CapsNet and Caps- Viterbi algorithm 
reduction, projection method optimization SoftPool method 
centering, profile algorithm and 
slanting, and analysis Adam optimizer 


skew estimation 


3. TELUGU LETTER DATASET 

In Telugu language there are total 645 letters i.e., 18 Achus, 38 Hallus, 35 Othulu, 34x16 
Guninthamulu, and 10 Ankelu as shown in Figure 1. Here a dedicated dataset was designed for Telugu language 
with 645 classes. Within each category (Achus, Hallus, Othulu, and Guninthamulu), ensure that there's a 
diverse range of examples. This could include different writing styles, sizes, fonts, and orientations. Consider 
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introducing variability between different categories to mimic real-world scenarios. For example, include 
combinations of Achus, Hallus, Othulu, and Guninthamulu in words to simulate how characters are used 
together. 


EEE E a |e [awl &]æ]æa] > 
er | ə a [ao fleatlflatsl se [eo] e: 
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S ð ð | ð|] 
s| eļs|s]š] 
sTsla]s|[s] 
ad | S o| s| s¢f[altatrls i 
s| š] œ] Ls [ 8 | 
mass] s | = | 6 | & | ® | o> | 5) M = | 
s | 5] S s = So | š: | 
vss] %» | ar | D | ® | » | a | m | ar! B | 
[Ls] [lapis [ S] x | eo! æ] 
smBdosas| ef | om | 8 â æ | wm | ðə | Se | a | 
[a [|a] o © | ðo] ae | 
srmdosse] So | Sr | So | So | x | sor | So | Se] = | 
[ So | » | =» | se | sx | %0] So: | 
sedosm! 5 | o | ə | S [| s | ds [Solder l[ 5 | 


Figure 1. Telugu letter considered for dataset 


4. PARALLEL CONVOLUTIONAL NEURAL NETWORKS 

The current standard in computer vision is deep CNNs (DCNNs). Despite all the work put into creating 
complex convolutional structures, it needs to be clarified how distinct the best CNNs are from one another. 
The current standard in computer vision (CV) is DCNNs. CNNs consistently place first in object recognition 
competitions and have been used for various visual tasks, including pose estimation, segmentation, object 
detection and localization, and visual saliency [25]. CNNs are a basis that may be used to instantiate numerous 
designs rather than a single design, which makes them less than ideal as a cognitive architecture. In contrast to 
their recognition rates, we compare CNNs based on how similar the final arrangement layers use the features 
to identify images. 


4.1. ResNet 

Contrarily, ResNet has a single-scale processing unit that is simpler, has numerous layers, and allows 
data to move over levels. A large portion of the CV works from the past five years may be viewed as a race 
between labs to develop the most compelling vision architecture using the deep network framework. 
Notwithstanding all the work put into creating convolutional structures, it needs to be clarified how distinct the 
best ones are from one another. ResNet was created at Microsoft in 2016 and then improved upon [26]. These 
numbers have been marginally enhanced by more modern architectures, including PNASNet. A large portion 
of current computer vision research can be seen as a race amongst research teams to develop the most effective 
deep network vision architecture. 

ResNet which consists of two convolutional layers and a non-parameterized shortcut link that sends 
the output of the unaltered as shown in Figure 2, were introduced in 2015 by Ahmad et al. [27] deploying a 
152-layer ResNet, significantly improved the challenge's state-of-the-art performance and proved that adding 
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layers constantly improves recognition accuracy. It is possible to show that advances are still being made 
beyond 1,000 layers, which was previously impossible. 

ResNet-v2 will be able to be improved further in 2016 [28]. This straightforward structural element has 
been included successfully in numerous additional DCNNs as listed in Table 2. These modules employ a 


concatenation of convolutional layers with various sizes and maximum pooling calculated from the same input. 


weight layer 


F(x) i 
F(x) +x © 


Figure 2. ResNet shortcut connections 


Table 2. Layers in ResNet 


Conv ; Layers 
lager Oe 18 34 50 101 152 
1 112x112 7x7,64, stride 2 
3x3 max pool, stride 2 
2X 56x56 ae oe axso 33.4 axe] 
= 3 x 3,64 3 x 3,64 3x 3,64 )>x3 3x 3,64)x3 3x 3,64)>x3 
1x 1,256 1 x 1,256 1x 1,256 
3 x 3,128 3 x 3,128 1x 1,128 1x 1,128 1x 1,128 
3X 28x28 3x Apai z 3x on x 3x 3128| x4 3x 3128| x4 3x 3128| x8 
1x 1,512 1x 1,512 1x 1,512 
3 x 3,256 3 x 3,256 1x 1,256 1x 1,256 1 x 1,256 
4X 14x14 3x o] x 3x REHE x 3 x 3,256 | x6  3x3,256 | x23 3X 3,256 | x36 
1 x 1,1024 1 x 1,1024 1 x 1,1024 
3 x 3,512 3 x 3,512 1x 1,512 1x 1,512 1x 1,512 
5_X 1x7 ee Ae 3x512 3 23512 | 3 23512 | 3 
1 x 1,2048 1 x 1,2048 1 x 1,2048 
1x1 Average pool, 1000-d fc, softmax 
FLOPs 1.8x10° 3.6x10° 3.8x10? 7.6x10? 11.3x10° 


To develop inception, many of these components are combined. While the authors continued to 
optimise for classification performance, the modular architecture underwent several improvements. Using 1x1 
convolutions, factorising nxn convolutional layers into stacked nx! and 1xn layers, and batch normalization 
are important techniques, albeit no single architectural aspect is to blame for Inception's performance [29]. 


4.2. Inception 
A CNN called inception separates processing by scale, blends the outcomes, and repeats. The 


inception family of CNNs includes inception. Inception's processing cost is also far cheaper than VGGNet or 
its more effective descendants. This has made it possible to use inceptions shown in Figure 3, in large data 
situations where huge quantities of data need to be processed affordably or where memory capability is 
naturally constrained, such as in mobile vision environments [30]. 


Filter Concat 


Figure 3. Inception module 


Telugu letters dataset and parallel deep convolutional neural network with a ... (Josyula Siva Phaniram) 


222 o ISSN: 2089-4864 


By using specialized methods to target memory utilization or by using computational heuristics to 
optimize the execution of specific activities, it is possible to partially minimize these concerns. These 
techniques, however, increase complexity. The efficiency difference could also be widened by using similar 
techniques to improve the inception architecture. 


5. OPTIMIZATION 

Several gradient descent (GD) methods are obtainable in the works that can be used to systematically 
address the issue of "local" minima, as previously discussed. In this section is a description of the approaches 
that are most frequently utilized. Batch gradient descent (BGD) determines the error, but the model is updated 
only after all the training samples have been assessed. A training epoch is a name given to this entire process, 
which resembles a cycle [31]. The benefits of computational effectiveness and stable convergence, as well as 
the drawbacks of BGD, are each epoch must have convergence to local minima and access to the entire training 
dataset. In contrast to BGD, stochastic gradient descent (SGD) calculates the error. In other words, it 
individually adjusts the settings. 

The main advantages are due to sequential data processing and generally faster than BGD, the 
drawbacks of SGD are the detailed rate of development. Frequent updates are costly computationally and 
introduce noise into gradients that hinder convergence. SGD has difficulty navigating ravines, which are 
frequently found near local optimums and are defined as regions where the surface slopes considerably more 
sharply in one dimension than another. A function's local minimum can be found using the optimization 
technique known as GD [32]. The weights are iteratively updated in backpropagation to reduce the error 
function. The GD optimization's calculation time can increase dramatically when training sets are big since 
each iteration requires computing every sample's outputs, errors, and gradients. In neural networks, SGD is 
thus virtually always favoured over GD using, the weights are updated. 

Wi = Wi +A Wi, (1) 
Where W;; weight update and t GD iteration. The two most popular SGD variations are batch learning and 
online learning. For each input (k) in on-line learning, the weights are changed for every input a(k), i.e., in 


(1) as (2): 


ag CK) 1 ,I-1(k) 
“Nowy OFF (2) 


where ņ the rate of learning, the network's hyperparameter controls how quickly things change. In batch 
learning, the weight gradients are calculated, and the error for a batch of K inputs A. 


OE U(k) l-1(k 
AW; = Sate a (3) 


Being caught in a local minimum that is too far from the wanted global least is a typical problem with 
gradient-based optimization techniques. Online learning helps to avoid such local minima because the 
stochastic error surface is noisy. Batch learning reduces noise and is more likely to become stuck in local 
minima by averaging the gradients. However, batch training is frequently favored because it can be carried out 
quite well with modern machines. Additionally, training techniques have been created to avoid local minima, 
rendering online learning unnecessary. Hyperparameters took into account for SGD optimization [33]. 


optimizers.SGD 


{ 
learning _rate=0.01, momentum=0.0, beta_1=0.9, beta_2=0.999, epsilon=1e-07, 
nesterov=False, name="SGD", 


The momentum approach in optimization algorithms, such as SGD, plays a crucial role in mitigating 
the issues of convergence to local minima during the training of neural networks. It accomplishes this by 
reducing the oscillations or swings in weight updates over successive iterations. By introducing a momentum 
term, which is essentially a moving average of past gradients, the algorithm gains inertia and tends to continue 
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in the direction of the previously accumulated gradients. This means that when the gradient changes direction, 
as it often does in complex loss landscapes, the momentum helps the optimizer to carry on and escape shallow 
local minima. In essence, momentum smoothes out the optimization path, allowing the algorithm to navigate 
more efficiently through the optimization space, ultimately speeding up convergence and improving the 
chances of finding a better global minimum. 


6. IMPLEMENTATION 

It was demonstrated to recognize Telugu characters using a parallel DCNN with a SGD optimizer 
(PDCNN-SGD) as shown in Figure 4. Inception and ResNet employ fully connected (FC) classifiers to assign 
labels to images based on the retrieved features after extracting features from images using convolutional 
architecture. The architectures and the number of features retrieved by the two systems differ; ResNet produces 
2,048 components per image, whereas inception produces 1,536 [34]. 

However, you will see that the features recovered by Inception and ResNet are extremely similar in that 
the affine mapping of one predicts the other. This implies that despite their structural differences, inception and 
ResNet utilize essentially the same aspects of images. While they might extract the same features, inception seems 
to accomplish so more robustly than ResNet. According to a further review of our results, it might extract a few 
additional properties. It is unexpected to learn that affine transformations connect ResNet and inception features. 


z 


z 


Figure 4. PDCNN-SGD optimizer for Telugu character recognition (TCR) 


Because CNNs can learn intricate non-linear aspects of images, they have completely changed the 
area of computer vision. The structural variations across the systems lead one to believe that the non-linear 
functions used by the various systems assume different shapes. However, as their features are linear changes 
of one another, Inception and ResNet appear to extract similar qualities from images. Their training algorithms 
appear to do solutions-finding hill-climbing in entirely different environments. 
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The two systems identical performance is explained by this affine relationship, which, in our opinion, 
also has wider ramifications. It implies that the content of the training images, rather than the specifics of 
Inception and ResNet's neural architectures, drives the features that are retrieved by those systems. If this is 
the case, many complex CNNs should behave similarly. We used 50 and 100 epochs as variations on the 
number of epochs in the trials we did as shown in Figures 5 and 6. 


Model Accuracy Model Loss 


—— yain 


04 — wl 0 


0 2 40 (a) £0 100 
Epoch 


Figure 5. Accuracy graph analysis of PDCNN-SGD Figure 6. Loss graph analysis of PDCNN-SGD 
technique technique 


Combining ResNet and inception, two prominent DL architectures, offers superior performance due to 
their complementary strengths. ResNet's ability to handle very deep networks and inception's efficient multi-scale 
feature extraction capabilities make them an ideal pair. This ensemble reduces overfitting, enhances 
generalization, and provides increased accuracy, making it a powerful choice for various computer vision tasks, 
including image classification, object detection, and semantic segmentation. According to experimental findings, 
our suggested method performs better and is appropriate for TCR as shown in Figure 7. 


Accurecy (%) 


ResNet Inception ResNet+Inception 
DCNN Approaches 


Figure 7. Accuracy of models with different DCNN approaches 


Due to three characteristics, experimental findings showed that the suggested hybrid model. While the 
success of most traditional classifiers depends mainly on the retrieval of suitable hand-crafted features 
extractor, which is a laborious and time-consuming operation, the prominent features of the images can be 
automatically retrieved by the hybrid model. Given that ResNet and inception are the most widely used and 
effective classifiers, the hybrid model incorporates the strengths of both techniques. Comparing the hybrid 
model to the CNN classification model only slightly increases the complexity of the decision-making process. 
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7. CONCLUSION 

The PDCNN-SGD model, designed for recognizing and classifying Telugu characters in handwritten 
text, is a complex DL architecture consisting of several stages of operations to achieve its objective effectively. 
The dataset comprises a total of 645 letters, including 18 Achus, 38 Hallus, 35 Othulu, 34x16 Guninthamulu, 
and 10 Ankelu. In the feature extraction stage, both ResNet and inception architectures are employed to extract 
rich and diverse feature representations from the input data. The SGD-based hyperparameter optimization 
process systematically tunes critical hyperparameters, such as learning rate, batch size, and weight decay, to 
find the optimal configuration for training the model, ensuring its readiness to achieve high accuracy in TCR 
tasks. 
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